learned image compression
Joint Autoregressive and Hierarchical Priors for Learned Image Compression
Recent models for learned image compression are based on autoencoders that learn approximately invertible mappings from pixels to a quantized latent representation. The transforms are combined with an entropy model, which is a prior on the latent representation that can be used with standard arithmetic coding algorithms to generate a compressed bitstream. Recently, hierarchical entropy models were introduced as a way to exploit more structure in the latents than previous fully factorized priors, improving compression performance while maintaining end-to-end optimization. Inspired by the success of autoregressive priors in probabilistic generative models, we examine autoregressive, hierarchical, and combined priors as alternatives, weighing their costs and benefits in the context of image compression. While it is well known that autoregressive models can incur a significant computational penalty, we find that in terms of compression performance, autoregressive and hierarchical priors are complementary and can be combined to exploit the probabilistic structure in the latents better than all previous learned models. The combined model yields state-of-the-art rate-distortion performance and generates smaller files than existing methods: 15.8% rate reductions over the baseline hierarchical model and 59.8%, 35%, and 8.4% savings over JPEG, JPEG2000, and BPG, respectively. To the best of our knowledge, our model is the first learning-based method to outperform the top standard image codec (BPG) on both the PSNR and MS-SSIM distortion metrics.
Optimizing Learned Image Compression on Scalar and Entropy-Constraint Quantization
Borzechowski, Florian, Schäfer, Michael, Schwarz, Heiko, Pfaff, Jonathan, Marpe, Detlev, Wiegand, Thomas
The continuous improvements on image compression with variational autoencoders have lead to learned codecs competitive with conventional approaches in terms of rate-distortion efficiency. Nonetheless, taking the quantization into account during the training process remains a problem, since it produces zero derivatives almost everywhere and needs to be replaced with a differentiable approximation which allows end-to-end optimization. Though there are different methods for approximating the quantization, none of them model the quantization noise correctly and thus, result in suboptimal networks. Hence, we propose an additional finetuning training step: After conventional end-to-end training, parts of the network are retrained on quantized latents obtained at the inference stage. For entropy-constraint quantizers like Trellis-Coded Quantization, the impact of the quantizer is particularly difficult to approximate by rounding or adding noise as the quantized latents are interdependently chosen through a trellis search based on both the entropy model and a distortion measure. We show that retraining on correctly quantized data consistently yields additional coding gain for both uniform scalar and especially for entropy-constraint quantization, without increasing inference complexity. For the Kodak test set, we obtain average savings between 1% and 2%, and for the TecNick test set up to 2.2% in terms of Bjøntegaard-Delta bitrate.
Causal Context Adjustment Loss for Learned Image Compression
In recent years, learned image compression (LIC) technologies have surpassed conventional methods notably in terms of rate-distortion (RD) performance. Most present learned techniques are VAE-based with an autoregressive entropy model, which obviously promotes the RD performance by utilizing the decoded causal context. However, extant methods are highly dependent on the fixed hand-crafted causal context. The question of how to guide the auto-encoder to generate a more effective causal context benefit for the autoregressive entropy models is worth exploring. In this paper, we make the first attempt in investigating the way to explicitly adjust the causal context with our proposed Causal Context Adjustment loss (CCA-loss).
Learned Image Compression and Restoration for Digital Pathology
Lee, SeonYeong, Seong, EonSeung, Lee, DongEon, Lee, SiYeoul, Cho, Yubin, Park, Chunsu, Kim, Seonho, Seo, MinKyung, Ko, YoungSin, Kim, MinWoo
L earned Image C ompressionand R estorationfor Digital Pathology Preprint, compiled A pril 2, 2025 SeonY eong Lee 1, EonSeung Seong 1, DongEon Lee 1, SiY eoul Lee 1, Y ubin Cho 1, Chunsu Park 1, Seonho Kim 1, MinKyung Seo 1, Y oungSin Ko 3, and MinWoo Kim 1,2,* 1 Department of Information Convergence Engineering, Pusan National University, Y angsan, Korea 2 School of Biomedical Convergence Engineering, Pusan National University, Y angsan, Korea 3 Seegene Medical Foundation, Seoul, Korea The first two authors contributed equally to this work. A bstract Digital pathology images play a crucial role in medical diagnostics, but their ultra-high resolution and large file sizes pose significant challenges for storage, transmission, and real-time visualization. To address these issues, we propose CLERIC, a novel deep learning-based image compression framework designed specifically for whole slide images (WSIs). CLERIC integrates a learnable lifting scheme and advanced convolutional techniques to enhance compression e ffi ciency while preserving critical pathological details. Our framework employs a lifting-scheme transform in the analysis stage to decompose images into low-and high-frequency components, enabling more structured latent representations. These components are processed through parallel encoders incorporating Deformable Residual Blocks (DRB) and Recurrent Residual Blocks (R2B) to improve feature extraction and spatial adaptability. The synthesis stage applies an inverse lifting transform for e ffective image reconstruction, ensuring high-fidelity restoration of fine-grained tissue structures. We evaluate CLERIC on a digital pathology image dataset and compare its performance against state-of-the-art learned image compression (LIC) models. Experimental results demonstrate that CLERIC achieves superior rate-distortion (RD) performance, significantly reducing storage requirements while maintaining high diagnostic image quality. Our study highlights the potential of deep learning-based compression in digital pathology, facilitating e fficient data management and long-term storage while ensuring seamless integration into clinical workflows and AI-assisted diagnostic systems. K eywords Learned Image Compression, Deep Learning, Wavelet Transform, Digital Pathology, Whole Slide Image. 1 I ntroduction Digital pathology images serve as fundamental data for various medical applications, playing a crucial role in cancer diagnosis, disease analysis, and treatment planning. These images are typically stored as Whole Slide Images (WSIs), which are characterized by ultra-high resolution (typically 0. 25µ m / px). A single uncompressed WSI can often exceed several gigabytes in size (e.g., 20-30 GB per image), posing significant challenges in terms of storage, transmission, and computational e ffi ciency.
Lightweight Embedded FPGA Deployment of Learned Image Compression with Knowledge Distillation and Hybrid Quantization
Mazouz, Alaa, Chaudhuri, Sumanta, Cagnanzzo, Marco, Mitrea, Mihai, Tartaglione, Enzo, Fiandrotti, Attilio
Learnable Image Compression (LIC) has shown the potential to outperform standardized video codecs in RD efficiency, prompting the research for hardware-friendly implementations. Most existing LIC hardware implementations prioritize latency to RD-efficiency and through an extensive exploration of the hardware design space. We present a novel design paradigm where the burden of tuning the design for a specific hardware platform is shifted towards model dimensioning and without compromising on RD-efficiency. First, we design a framework for distilling a leaner student LIC model from a reference teacher: by tuning a single model hyperparameters, we can meet the constraints of different hardware platforms without a complex hardware design exploration. Second, we propose a hardware-friendly implementation of the Generalized Divisive Normalization - GDN activation that preserves RD efficiency even post parameter quantization. Third, we design a pipelined FPGA configuration which takes full advantage of available FPGA resources by leveraging parallel processing and optimizing resource allocation. Our experiments with a state of the art LIC model show that we outperform all existing FPGA implementations while performing very close to the original model.
Joint Autoregressive and Hierarchical Priors for Learned Image Compression
Recent models for learned image compression are based on autoencoders that learn approximately invertible mappings from pixels to a quantized latent representation. The transforms are combined with an entropy model, which is a prior on the latent representation that can be used with standard arithmetic coding algorithms to generate a compressed bitstream. Recently, hierarchical entropy models were introduced as a way to exploit more structure in the latents than previous fully factorized priors, improving compression performance while maintaining end-to-end optimization. Inspired by the success of autoregressive priors in probabilistic generative models, we examine autoregressive, hierarchical, and combined priors as alternatives, weighing their costs and benefits in the context of image compression. While it is well known that autoregressive models can incur a significant computational penalty, we find that in terms of compression performance, autoregressive and hierarchical priors are complementary and can be combined to exploit the probabilistic structure in the latents better than all previous learned models.
Reviews: Joint Autoregressive and Hierarchical Priors for Learned Image Compression
Summary This paper extends the autoencoder trained for compression of Balle et al. (2018) with a small autoregressive model. The autoencoder of Balle uses Gaussian scale mixtures (GSMs) for entropy encoding of coefficients, and encodes its latent variables as side information in the bit stream. Here, conditional Gaussian mixtures are used which additionally use neighboring coefficients as context. The authors find that this significantly improves compression performance. Good – Good performance (notably, state-of-the-art MS-SSIM results without optimizing directly on this metric) – Extensive supplementary materials, including rate-distortion curves for individual images – Well written Bad – Incremental, with no real conceptual contributions – Missing related work: There is a long history of conditional Gaussian mixture models for autoregressive modeling of images – including for entropy rate estimation – that is arguably more relevant than other generative models mentioned in the paper: Domke et al. (2008), Hosseini et al. (2010), Theis et al. (2012), Uria et al. (2013), Theis et al. (2015)
GABIC: Graph-based Attention Block for Image Compression
Spadaro, Gabriele, Presta, Alberto, Tartaglione, Enzo, Giraldo, Jhony H., Grangetto, Marco, Fiandrotti, Attilio
While standardized codecs like JPEG and HEVC-intra represent the industry standard in image compression, neural Learned Image Compression (LIC) codecs represent a promising alternative. In detail, integrating attention mechanisms from Vision Transformers into LIC models has shown improved compression efficiency. However, extra efficiency often comes at the cost of aggregating redundant features. This work proposes a Graph-based Attention Block for Image Compression (GABIC), a method to reduce feature redundancy based on a k-Nearest Neighbors enhanced attention mechanism. Our experiments show that GABIC outperforms comparable methods, particularly at high bit rates, enhancing compression performance.
Region of Interest Loss for Anonymizing Learned Image Compression
Liebender, Christoph, Bezerra, Ranulfo, Ohno, Kazunori, Tadokoro, Satoshi
The use of AI in public spaces continually raises concerns about privacy and the protection of sensitive data. An example is the deployment of detection and recognition methods on humans, where images are provided by surveillance cameras. This results in the acquisition of great amounts of sensitive data, since the capture and transmission of images taken by such cameras happens unaltered, for them to be received by a server on the network. However, many applications do not explicitly require the identity of a given person in a scene; An anonymized representation containing information of the person's position while preserving the context of them in the scene suffices. We show how using a customized loss function on region of interests (ROI) can achieve sufficient anonymization such that human faces become unrecognizable while persons are kept detectable, by training an end-to-end optimized autoencoder for learned image compression that utilizes the flexibility of the learned analysis and reconstruction transforms for the task of mutating parts of the compression result. This approach enables compression and anonymization in one step on the capture device, instead of transmitting sensitive, nonanonymized data over the network. Additionally, we evaluate how this anonymization impacts the average precision of pre-trained foundation models on detecting faces (MTCNN) and humans (YOLOv8) in comparison to non-ANN based methods, while considering compression rate and latency.
Compressible and Searchable: AI-native Multi-Modal Retrieval System with Learned Image Compression
The burgeoning volume of digital content across diverse modalities necessitates efficient storage and retrieval methods. Conventional approaches struggle to cope with the escalating complexity and scale of multimedia data. In this paper, we proposed framework addresses this challenge by fusing AI-native multi-modal search capabilities with neural image compression. First we analyze the intricate relationship between compressibility and searchability, recognizing the pivotal role each plays in the efficiency of storage and retrieval systems. Through the usage of simple adapter is to bridge the feature of Learned Image Compression(LIC) and Contrastive Language-Image Pretraining(CLIP) while retaining semantic fidelity and retrieval of multi-modal data. Experimental evaluations on Kodak datasets demonstrate the efficacy of our approach, showcasing significant enhancements in compression efficiency and search accuracy compared to existing methodologies. Our work marks a significant advancement towards scalable and efficient multi-modal search systems in the era of big data.